Ayodeji: Implemented blocked and parallel matrix multiplication by Deji10 · Pull Request #9 · AA-parallel-computing/Assignment-4-Optional

Deji10 · 2026-06-01T13:26:51Z

Implementation of blocked and parallel matrix multiplication for Assignment 4.

Implementation

naive_matmul: Standard triple-nested loop baseline
blocked_matmul: Cache-optimized; default block_size = 64, but block size 16 found to be optimal
parallel_matmul: OpenMP-based with #pragma omp parallel for collapse(2)

All timings averaged over 5 independent runs for stable measurements
Three additional modes added: default, blocks (block-size sweep), threads (thread-count sweep)

All 10 test cases pass correctness validation (output.raw, tolerance 1e-2)
Block size experiment: tested 16/32/64/128 → block size 16 gives best speedup (2.33×) on Case 7; the conventional default of 64 was not optimal
Thread count experiment: tested 1/2/4/8 → 2 threads optimal (1.52×) on the 2-core Codespaces environment; 8 threads slower than 1 due to scheduling overhead
Combined optimal (block 16 + 2 threads) gives ~3× over naive

Full performance tables and analysis available in README.md.

Local Windows toolchain (no g++ installed); completed in GitHub Codespaces
Codespaces 2-core CPU limit caps achievable parallel speedup
Provided test cases are small (largest 256 × 300); larger matrices would showcase parallelism better
Submitted late due to setup issues

Co-authored with Muhammad Zahid.

Co-authored-by: Muhammad Zahid <muhammad.zahid@example.com>

ayodeji-ibrahim: Implemented blocked and parallel matrix multiplication

29aebfc

Co-authored-by: Muhammad Zahid <muhammad.zahid@example.com>

Deji10 force-pushed the ayodeji-ibrahim branch from 7656d49 to 29aebfc Compare June 1, 2026 13:55